In [ ]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

In [2]:
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()


<matplotlib.figure.Figure at 0x7f25a302a080>

The aim of this notebook is to make a submission for this competition as quickly as possible. Just want to get something in.

Loading the data

Getting all the filenames for Dog_1 ready, going to do the analysis on this first, then iterate over the rest after that.


In [3]:
cd ..


/home/gavin/repositories/hail-seizure

In [4]:
import train
import json
import imp

First, have to load the settings from the json file:


In [5]:
settings = json.load(open('SETTINGS.json', 'r'))

In [6]:
settings.keys()


Out[6]:
dict_keys(['RAW_DATA_DIRS', 'SUBJECTS', 'MODEL_PATH', 'FEATURES', 'SUBMISSION_PATH', 'DATA_TYPES', 'VERSION', 'TEST_DATA_PATH', 'TRAIN_DATA_PATH'])

In [7]:
data = train.get_data(settings['FEATURES'])

Doing this we get a dictionary of dictionarys:


In [10]:
type(data)


Out[10]:
dict

In [11]:
data.keys()


Out[11]:
dict_keys(['raw_feat_var_', 'raw_feat_pib_', 'raw_feat_corrcoef_', 'raw_feat_cov_', 'raw_feat_xcorr_'])

In [12]:
data['raw_feat_var_'].keys()


Out[12]:
dict_keys(['Patient_2', 'Dog_3', 'Patient_1', 'Dog_1', 'Dog_4', 'Dog_2', 'Dog_5'])

In [13]:
data['raw_feat_var_']['Patient_2'].keys()


Out[13]:
dict_keys(['interictal', 'preictal', 'test'])

In [14]:
data['raw_feat_var_']['Patient_2']['interictal'].keys()


Out[14]:
dict_keys(['Patient_2_interictal_segment_0034.mat', 'Patient_2_interictal_segment_0013.mat', 'Patient_2_interictal_segment_0016.mat', 'Patient_2_interictal_segment_0040.mat', 'Patient_2_interictal_segment_0007.mat', 'Patient_2_interictal_segment_0039.mat', 'Patient_2_interictal_segment_0012.mat', 'Patient_2_interictal_segment_0033.mat', 'Patient_2_interictal_segment_0035.mat', 'Patient_2_interictal_segment_0006.mat', 'Patient_2_interictal_segment_0020.mat', 'Patient_2_interictal_segment_0015.mat', 'Patient_2_interictal_segment_0028.mat', 'Patient_2_interictal_segment_0003.mat', 'Patient_2_interictal_segment_0008.mat', 'Patient_2_interictal_segment_0014.mat', 'Patient_2_interictal_segment_0002.mat', 'Patient_2_interictal_segment_0022.mat', 'Patient_2_interictal_segment_0005.mat', 'Patient_2_interictal_segment_0029.mat', 'Patient_2_interictal_segment_0032.mat', 'Patient_2_interictal_segment_0025.mat', 'Patient_2_interictal_segment_0037.mat', 'Patient_2_interictal_segment_0036.mat', 'Patient_2_interictal_segment_0009.mat', 'Patient_2_interictal_segment_0019.mat', 'Patient_2_interictal_segment_0041.mat', 'Patient_2_interictal_segment_0011.mat', 'Patient_2_interictal_segment_0027.mat', 'Patient_2_interictal_segment_0023.mat', 'Patient_2_interictal_segment_0010.mat', 'Patient_2_interictal_segment_0042.mat', 'Patient_2_interictal_segment_0004.mat', 'Patient_2_interictal_segment_0030.mat', 'Patient_2_interictal_segment_0018.mat', 'Patient_2_interictal_segment_0026.mat', 'Patient_2_interictal_segment_0038.mat', 'Patient_2_interictal_segment_0001.mat', 'Patient_2_interictal_segment_0021.mat', 'Patient_2_interictal_segment_0031.mat', 'Patient_2_interictal_segment_0024.mat', 'Patient_2_interictal_segment_0017.mat'])

It's dictionaries all the way down.

Until you get to the feature vectors, obviously:


In [16]:
data['raw_feat_var_']['Patient_2']['interictal']['Patient_2_interictal_segment_0034.mat']


Out[16]:
array([[ 27997.01668904],
       [ 35989.2985339 ],
       [ 37794.52532364],
       [  6949.26002361],
       [  5195.24778149],
       [  3510.06425946],
       [  2179.78818952],
       [  1385.7960766 ],
       [ 11534.77415212],
       [ 12426.21191639],
       [ 14570.52442261],
       [ 23987.76874976],
       [ 19616.06321446],
       [ 19970.38598875],
       [  3789.33438728],
       [  1386.13668337],
       [  2801.36957988],
       [  4358.93826166],
       [ 19564.32297182],
       [ 22553.67189286],
       [  6853.33525793],
       [  6837.8633967 ],
       [  1925.52227487],
       [  1133.48431202]])

Unfortunately, we want a feature matrix and a target vector to shove into whatever machine learning code we want to use. Should be pretty easy to get that out of the above data structure though. Requirements of this code:

  • Input: subject, features, data
  • Output_: X feature matrix, y target vector

Prototyping this function in this notebook, then will save to utils.py.


In [19]:
import numpy as np

In [51]:
def buildtraining(subject,features,data):
    """Function to build data structures for ML:
    
    * __Input__: subject, features
    * __Output___: X feature matrix, y target vector
    
    It will not tell you which feature is which."""
    # hacking this for later
    first = features[0]
    for feature in features:
        Xf = np.array([])
        # enumerate to get numbers for target vector:
        #     0 is interictal
        #     1 is preictal
        for i,ictal in enumerate(['interictal','preictal']):
            for segment in data[feature][subject][ictal].keys():
                # now stack up the feature vectors
                try:
                    Xf = np.vstack([Xf,data[feature][subject][ictal][segment].T])
                except ValueError:
                    Xf = data[feature][subject][ictal][segment].T
                # and stack up the target vector
                # but only for the first feature (will be the same for the rest)
                if feature == first:
                    try:
                        y.append(i)
                    except NameError:
                        y = [i]
        # stick the X arrays together
        try:
            X = np.hstack([X,Xf])
        except NameError:
            X = Xf
        except ValueError:
            print(feature)
            print(X.shape,Xf.shape)
    # turn y into an array
    y = np.array(y)
    return X,y

How the enumerate works:


In [21]:
for i,x in enumerate(['interictal','preictal']):
    print(i,x)


0 interictal
1 preictal

Testing the above:


In [41]:
X,y = buildtraining('Dog_1',['raw_feat_var_','raw_feat_cov_'],data)

In [43]:
X.shape


Out[43]:
(504, 136)

In [44]:
y.shape


Out[44]:
(504,)

Appears to have worked.

Attempting on all features.


In [52]:
X,y = buildtraining('Dog_1',list(data.keys()),data)


raw_feat_pib_
(504, 16) (504, 16, 6)
raw_feat_xcorr_
(504, 256) (504, 120, 2)

Caught the above errors, looks like those two features are a bit weird. Maybe they're not coming in as vectors?

Should probably just flatten them.


In [13]:
def buildtraining(subject,features,data):
    """Function to build data structures for ML:
    
    * __Input__: subject, features
    * __Output___: X feature matrix, y target vector
    
    It will not tell you which feature is which."""
    # hacking this for later
    first = features[0]
    for feature in features:
        Xf = np.array([])
        # enumerate to get numbers for target vector:
        #     0 is interictal
        #     1 is preictal
        for i,ictal in enumerate(['interictal','preictal']):
            for segment in data[feature][subject][ictal].keys():
                # now stack up the feature vectors
                try:
                    Xf = np.vstack([Xf,np.ndarray.flatten(data[feature][subject][ictal][segment].T)])
                except ValueError:
                    Xf = np.ndarray.flatten(data[feature][subject][ictal][segment].T)
                # and stack up the target vector
                # but only for the first feature (will be the same for the rest)
                if feature == first:
                    try:
                        y.append(i)
                    except NameError:
                        y = [i]
        # stick the X arrays together
        try:
            X = np.hstack([X,Xf])
        except NameError:
            X = Xf
        except ValueError:
            print(feature)
            print(X.shape,Xf.shape)
    # turn y into an array
    y = np.array(y)
    return X,y

In [14]:
X,y = buildtraining('Dog_1',list(data.keys()),data)

Ok, now appears to work.


In [82]:
%save buildtraining.py 56


The following commands were written to file `buildtraining.py`:
def buildtraining(subject,features,data):
    """Function to build data structures for ML:
    
    * __Input__: subject, features
    * __Output___: X feature matrix, y target vector
    
    It will not tell you which feature is which."""
    # hacking this for later
    first = features[0]
    for feature in features:
        Xf = np.array([])
        # enumerate to get numbers for target vector:
        #     0 is interictal
        #     1 is preictal
        for i,ictal in enumerate(['interictal','preictal']):
            for segment in data[feature][subject][ictal].keys():
                # now stack up the feature vectors
                try:
                    Xf = np.vstack([Xf,np.ndarray.flatten(data[feature][subject][ictal][segment].T)])
                except ValueError:
                    Xf = np.ndarray.flatten(data[feature][subject][ictal][segment].T)
                # and stack up the target vector
                # but only for the first feature (will be the same for the rest)
                if feature == first:
                    try:
                        y.append(i)
                    except NameError:
                        y = [i]
        # stick the X arrays together
        try:
            X = np.hstack([X,Xf])
        except NameError:
            X = Xf
        except ValueError:
            print(feature)
            print(X.shape,Xf.shape)
    # turn y into an array
    y = np.array(y)
    return X,y

Learning

Now we can actually do all the machine learning we need to do to make a submission. Not even bothering with feature selection, just going to build a pipeline and run K-fold cross validation.


In [8]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
import sklearn.svm

In [9]:
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('scl',scaler),('clf',forest)])

In [10]:
svc = sklearn.svm.SVC()
modelsvc = sklearn.pipeline.Pipeline([('scl',scaler),('clf',svc)])

Starting with default settings:


In [15]:
tenfold = sklearn.cross_validation.StratifiedKFold(y,n_folds=10)

In [16]:
sklearn.cross_validation.cross_val_score(model,X,y,cv=tenfold)


Out[16]:
array([ 0.94117647,  0.94117647,  0.94117647,  0.94117647,  0.96      ,
        0.96      ,  0.96      ,  0.96      ,  0.96      ,  0.96      ])

In [17]:
X.shape


Out[17]:
(504, 62080)

Not that impressive, when you look at the all zeros for this one.


In [80]:
1-sum(y)/len(y)


Out[80]:
0.95238095238095233

Trying increasing the number of trees we're using.


In [18]:
model.set_params(clf__n_estimators=3000)


Out[18]:
Pipeline(steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=3000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0))])

In [19]:
%%time
sklearn.cross_validation.cross_val_score(model,X,y,cv=tenfold)


CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 13.8 µs
Out[19]:
array([ 0.94117647,  0.94117647,  0.94117647,  0.94117647,  0.96      ,
        0.96      ,  0.96      ,  0.96      ,  0.96      ,  0.96      ])

In [165]:
sklearn.cross_validation.cross_val_score(modelsvc,X,y,cv=tenfold)


Out[165]:
array([ 0.94117647,  0.94117647,  0.94117647,  0.94117647,  0.96      ,
        0.96      ,  0.96      ,  0.96      ,  0.96      ,  0.96      ])

In [94]:
%%time
sklearn.cross_validation.cross_val_score(model,X,y,cv=tenfold,scoring='roc_auc')


Out[94]:
array([ 0.94444444,  0.72569444,  0.77777778,  0.95833333,  0.9375    ,
        1.        ,  0.84375   ,  0.76041667,  0.546875  ,  0.875     ])

In [166]:
sklearn.cross_validation.cross_val_score(modelsvc,X,y,cv=tenfold,scoring='roc_auc')


Out[166]:
array([ 0.86111111,  0.92361111,  0.93055556,  0.86111111,  0.85416667,
        0.89583333,  0.88541667,  0.84375   ,  0.92708333,  0.96875   ])

Well, that's not below 0.5, so that's good enough for a submission. Time to go ahead and do that.

So, I'll need another function like the one above to create a test matrix for each subject. Then, I can iterate over subjects, training the model and then classifying the test.


In [20]:
def buildtest(subject,features,data):
    """Function to build data structures for submission:
    
    * __Input__: subject, features, data
    * __Output___: X feature matrix, labels
    
    It will not tell you which feature is which."""
    Xd = {}
    for feature in features:
        for segment in data[feature][subject]['test'].keys():
            fvector = np.ndarray.flatten(data[feature][subject]['test'][segment])
            try: 
                Xd[segment] = np.hstack([Xd[segment],fvector])
            except:
                Xd[segment] = fvector
    # make the X array and corresponding labels
    segments = []
    X = []
    for segment in Xd.keys():
        segments.append(segment)
        X.append(Xd[segment])
    X = np.vstack(X)
    return X,segments

In [23]:
features = list(data.keys())
subjects = list(data[features[0]].keys())

Had to remove this feature as it didn't cover all subjects for some reason.


In [22]:
features.remove('raw_feat_xcorr_')

In [24]:
X,segments = buildtest(subjects[0],features,data)

Works, but not saving as I want to reorganise the code I already saved as well.


In [18]:
model.set_params(clf__n_estimators=3000)


Out[18]:
Pipeline(steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=3000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0))])

In [26]:
%%time
predictiondict = {}
for subj in subjects:
    # training step
    X,y = buildtraining(subj,features,data)
    model.fit(X,y)
    # prediction step
    X,segments = buildtest(subj,features,data)
    predictions = model.predict_proba(X)
    for segment,prediction in zip(segments,predictions):
        predictiondict[segment] = prediction


CPU times: user 34min 38s, sys: 40.2 s, total: 35min 19s
Wall time: 37min 48s

Running for SVC as well.


In [167]:
%%time
svcpredictiondict = {}
for subj in subjects:
    # training step
    X,y = buildtraining(subj,features,data)
    modelsvc.fit(X,y)
    # prediction step
    X,segments = buildtest(subj,features,data)
    predictions = modelsvc.predict(X)
    for segment,prediction in zip(segments,predictions):
        svcpredictiondict[segment] = prediction


CPU times: user 1.03 s, sys: 3.33 ms, total: 1.04 s
Wall time: 1.04 s

Trying Logistic regression as we now have many more features.


In [32]:
import sklearn.linear_model

In [33]:
logreg = sklearn.linear_model.LogisticRegression()
modellr = sklearn.pipeline.Pipeline([('scl',scaler),('clf',logreg)])

In [36]:
%%time
lrpredictiondict = {}
for subj in subjects:
    # training step
    X,y = buildtraining(subj,features,data)
    modellr.fit(X,y)
    # prediction step
    X,segments = buildtest(subj,features,data)
    predictions = modellr.predict_proba(X)
    for segment,prediction in zip(segments,predictions):
        lrpredictiondict[segment] = prediction


CPU times: user 8min 12s, sys: 50.8 s, total: 9min 3s
Wall time: 12min 13s

Saving results to csv

Have to save these results to the csv form requested:


In [27]:
import csv

In [30]:
with open("output/protosubmission.csv","w") as f:
    c = csv.writer(f)
    c.writerow(['clip','preictal'])
    for seg in predictiondict.keys():
        c.writerow([seg,"%s"%predictiondict[seg][-1]])

In [31]:
!head output/protosubmission.csv












In [154]:
!wc -l output/protosubmission.csv


3936 output/protosubmission.csv

Looks like it's the right length. Submitted now and we got 0.59308 for it, which isn't too bad.

Saving LR as well:


In [37]:
with open("output/protolr.csv","w") as f:
    c = csv.writer(f)
    c.writerow(['clip','preictal'])
    for seg in predictiondict.keys():
        c.writerow([seg,"%s"%lrpredictiondict[seg][-1]])

In [170]:
!head output/protosvc.csv